Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Fix #36: Add initial Fathom rules with 98.7% average training accuracy #45

Merged
merged 9 commits into from
Aug 27, 2018

Conversation

biancadanforth
Copy link
Collaborator

@biancadanforth biancadanforth commented Aug 12, 2018

@Osmose , ready for review! I'm going to put my conclusions in #36 shortly, but basically, I think we should move ahead with Fathom with a CSS selector fallback.


Builds off #38 with a more extensive ruleset based off a training corpus (available on the Commerce Google drive in the Engineering subfolder) of 50 product pages from our top five sites.

Training:

  • 98.7% average accuracy across all 3 features for the training set (100% title, 100% image, 96% price).

Coefficients and values:

(Pulled from ./src/fathom_coefficients.json)

{
  "largerImageCoeff": 2,
  "largerFontSizeCoeff": 7,
  "hasDollarSignCoeff": 8,
  "hasPriceInIDCoeff": 17,
  "hasPriceInClassNameCoeff": 2,
  "isAboveTheFoldPriceCoeff": 33,
  "isAboveTheFoldImageCoeff": 13,
  "isNearbyImageXAxisPriceCoeff": 5,
  "isNearbyImageYAxisTitleCoeff": 5,
  "hasPriceishPatternCoeff": 15
}

Image - best accuracy: 100%

screen shot 2018-08-17 at 11 07 08 am

Price - best accuracy: 94%

screen shot 2018-08-17 at 11 08 32 am

Title - best accuracy: 100%

screen shot 2018-08-17 at 11 05 59 am

const extractedElements = {};
const rules = productRuleset.get('product').rulesetMaker;
for (const feature of PRODUCT_FEATURES) {
let fnodesList = rules([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit of compilation that happens on ruleset construction, so, for speed, you might want to eventually construct this once, in global scope, and keep it around.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks; makes sense. What's happening behind the scenes there?

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this. Nice work on getting this to an acceptable accuracy in such short time!

isAboveTheFoldImageCoeff,
isNearbyImageXAxisCoeff,
hasPriceishPatternCoeff,
]).against(doc).get(`${feature}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-creates and re-runs the ruleset for each feature, right? If this is all in a single ruleset we can just run it once, can't we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string interpolation for feature has no effect here, just pass it directly.


const SCORE_THRESHOLD = fathomCoeffs.hasPriceClass;
const PRODUCT_FEATURES = ['title', 'price', 'image'];
const SCORE_THRESHOLD = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined in two spots, can we define it once and import it?

extractedProduct[feature] = (feature === 'image'
? extractedElements[feature].src
: extractedElements[feature].innerText
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ternary statements aren't syntactically clear, so it's worth avoiding them unless they are clearer to read than the alternative. Same for the return statement below.

* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

import {dom, out, rule, ruleset, score, type} from 'fathom-web';
import {ancestors} from 'fathom-web/utils'; // for training: utilsForFrontend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For training, the trainer add-on imports Fathom as a submodule, and its utilities file is called "utilsForFrontend" rather than "utils".

This is one of the two action items @erikrose was going to work on. I will make it more clear in the meantime.

|| style.display === 'none'
|| style.opacity === '0'
|| style.width === '0'
|| style.height === '0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing large conditions like this in a variable makes it easier to read the if statement:

const isElementHidden = (
  style.visibility === 'hidden'
  || style.display === 'none'
  || style.opacity === '0'
  || style.width === '0'
  || style.height === '0'
);
if (elementIsHidden) {
  return false;
}

const imageElement = fnode._ruleset.get('image')[0].element; // eslint-disable-line no-underscore-dangle
const imageDOMRect = imageElement.getBoundingClientRect();
const deltaX = eleDOMRect.left - imageDOMRect.right;
// priceish element is always* to the right of the image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the context of our top 5? I don't think we should make that assumption. Can we not just take the absolute value of the delta?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the absolute value of the delta alone won't work, as if the element overlaps with the image element in the x-axis, it would score higher than an element that doesn't. However, I can calculate a deltaX based on whether the element is to the left or right of the image (without overlapping) to cover cases where the element is to the left of the image.

*/
function hasPriceishPattern(fnode) {
const text = fnode.element.innerText;
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two asterisks are used for doc comments at the start of a function/class/variable. This is just a regular comment and should use //.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I automatically associate // with single line comments and /* */ with multi-line comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A /* */ comment would also be fine. A block comment that starts with two stars (/**) instead of one (/*) is specifically a doc comment and should follow JSDoc conventions, and should only occur either immediately above a variable/class/function/module that it is documenting, or be one of the few exceptions (such as type definitions).

/**
* Checks if fnode is nearby the top scoring image element in the y-axis
*/
function isNearbyImageYAxis(fnode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Y axis used for eligibility but the X axis is used for scoring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y-axis
For price, there is little correlation in vertical proximity to the image, as some pages' price is toward the top of the image, and in others, it's toward the middle. On a lot of pages, like the one below, there are lots of priceish candidates that could score higher than the correct price depending on whether I choose the center or top of the image as the reference point.

screen shot 2018-08-17 at 9 44 13 am

The title is consistently near the top of the image, so it might benefit from being scored by vertical proximity.

X-axis
For the X-axis, there is a reliable correlation between x-proximity and the correct price (see example page above). I don't use this function to score title as I did not see a correlation there (also titles can be above the image, which breaks this rule).

FWIW: I just tried making the X function a binary check like the Y function, and I got much worse accuracy (80% compared to 94% on price).
I also tried making the Y function a fuzzier measure, scoring it like the X. Unfortunately for price, on some pages it's closer to the top of the image, on others it's closer to the center or center bottom. Accuracy on price went down from 94% to 74%. Interestingly for title, the title elements are consistently closer to the top of the image, and making that change brought accuracy up from 96% to 100%.

Conclusion
In conclusion, I am going to change the binary Y-axis rule to a score-based rule for title, and I am going to leave the binary Y-axis rule as-is for price. I am also going to leave the score-based X-axis rule as-is for price, though I renamed it for clarity, since that rule is only used for price and not other features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth leaving a short comment explaining that Y axis scoring has worse results.

const DOMRect = element.getBoundingClientRect();
const imageElement = fnode._ruleset.get('image')[0].element; // eslint-disable-line no-underscore-dangle
const imageDOMRect = imageElement.getBoundingClientRect();
if (DOMRect.top >= (imageDOMRect.top - TOP_BUFFER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's more readable when splitting stuff into multiple lines to not put part of what you're splitting on the first line. Maybe change this to:

if (
  DOMRect.top >= (imageDOMRect.top - TOP_BUFFER)
  && DOMRect.bottom <= imageDOMRect.bottom
) {

Or even better, store the comparison in a variable that describes what it's computing.

* Image rules
*
* If training, leave uncommented, as 'price' and 'title' rules depend
* on the `out` of these 'image' rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the dependency and the limitation that FathomFox can only train based on a single type, but if it's okay to leave this uncommented during a training run, why do we need to comment out the other ones during a training run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked this, and you're right that we don't need to comment out rules in a ruleset that we're not using to train a particular feature. Will remove those comments.

Copy link
Collaborator Author

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My replies to feedback from Osmose and erikrose.

Thank you for the review! I will file one commit for the majority of changes, and a second commit for larger structural changes recommended in this comment and this comment.

const extractedElements = {};
const rules = productRuleset.get('product').rulesetMaker;
for (const feature of PRODUCT_FEATURES) {
let fnodesList = rules([
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks; makes sense. What's happening behind the scenes there?

* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

import {dom, out, rule, ruleset, score, type} from 'fathom-web';
import {ancestors} from 'fathom-web/utils'; // for training: utilsForFrontend
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For training, the trainer add-on imports Fathom as a submodule, and its utilities file is called "utilsForFrontend" rather than "utils".

This is one of the two action items @erikrose was going to work on. I will make it more clear in the meantime.

* Rulesets to train.
*
* Drop this file into the fathom-trainees/src folder (replacing the default file)
* to train Fathom against this ruleset.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with Osmose in slack to clarify this, I will submit a separate commit to this PR to address this issue with more detail there.

function largerFontSize(fnode) {
const sizeWithUnits = window.getComputedStyle(fnode.element).fontSize;
const size = sizeWithUnits.replace('px', '');
if (size) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I didn't know that; seems to check out though.

function hasPriceInID(fnode) {
const element = fnode.element;
const parentElement = element.parentElement;
const ID = element.id;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have said previously that abbreviations/acronyms should be capitalized, so that's why I have it capitalized. Perhaps it becomes confusing though when there are no other terms in the variable name? e.g. ID versus parentID?

/**
* Checks if fnode has the same innerText as any of its children
*/
function removeRedundantAncestors(fnode) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename the method to hasDifferentInnerTextThanChildren, since that's really what's being checked.

Effectively what this does is remove elements who have the same innerText as their child, since it is used in a when clause in a rule further down in this file.

I observed that a lot of <div> and <span> elements are style wrappers on other elements. Rather than keeping all of those "redundant" innerText elements in the game and scoring them (they'd end up with the same or similar score as the innermost element), I'd like to eliminate them right at the beginning. This has two benefits:

  1. Performance: the more elements you eliminate from scoring, the less computation is needed to find the highest scoring element.
  2. Training: When I tag an element as the "correct" price element, I can only choose one for each feature. I was getting wrong accuracy numbers, since the value of the innerText was correct, but the specific element Fathom was choosing would vary from the innermost element to one of its ancestors.

const imageElement = fnode._ruleset.get('image')[0].element; // eslint-disable-line no-underscore-dangle
const imageDOMRect = imageElement.getBoundingClientRect();
const deltaX = eleDOMRect.left - imageDOMRect.right;
// priceish element is always* to the right of the image
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the absolute value of the delta alone won't work, as if the element overlaps with the image element in the x-axis, it would score higher than an element that doesn't. However, I can calculate a deltaX based on whether the element is to the left or right of the image (without overlapping) to cover cases where the element is to the left of the image.

*/
function hasPriceishPattern(fnode) {
const text = fnode.element.innerText;
/**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I automatically associate // with single line comments and /* */ with multi-line comments.

/**
* Checks if fnode is nearby the top scoring image element in the y-axis
*/
function isNearbyImageYAxis(fnode) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y-axis
For price, there is little correlation in vertical proximity to the image, as some pages' price is toward the top of the image, and in others, it's toward the middle. On a lot of pages, like the one below, there are lots of priceish candidates that could score higher than the correct price depending on whether I choose the center or top of the image as the reference point.

screen shot 2018-08-17 at 9 44 13 am

The title is consistently near the top of the image, so it might benefit from being scored by vertical proximity.

X-axis
For the X-axis, there is a reliable correlation between x-proximity and the correct price (see example page above). I don't use this function to score title as I did not see a correlation there (also titles can be above the image, which breaks this rule).

FWIW: I just tried making the X function a binary check like the Y function, and I got much worse accuracy (80% compared to 94% on price).
I also tried making the Y function a fuzzier measure, scoring it like the X. Unfortunately for price, on some pages it's closer to the top of the image, on others it's closer to the center or center bottom. Accuracy on price went down from 94% to 74%. Interestingly for title, the title elements are consistently closer to the top of the image, and making that change brought accuracy up from 96% to 100%.

Conclusion
In conclusion, I am going to change the binary Y-axis rule to a score-based rule for title, and I am going to leave the binary Y-axis rule as-is for price. I am also going to leave the score-based X-axis rule as-is for price, though I renamed it for clarity, since that rule is only used for price and not other features.

* Image rules
*
* If training, leave uncommented, as 'price' and 'title' rules depend
* on the `out` of these 'image' rules.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked this, and you're right that we don't need to comment out rules in a ruleset that we're not using to train a particular feature. Will remove those comments.

@biancadanforth biancadanforth changed the title Fix #36: Add initial Fathom rules with 96.7% average training accuracy Fix #36: Add initial Fathom rules with 98.7% average training accuracy Aug 17, 2018
@biancadanforth
Copy link
Collaborator Author

@Osmose , ready for another look.

I created #61 as a follow-up issue as we discussed in Slack.

@biancadanforth biancadanforth requested a review from Osmose August 20, 2018 20:23
Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this is very close!

I dunno why some of my comments are marked as outdated. I was trying to comment on a subset of the diff. Let me know if they're actually outdated and I'll see about updating them.


const PRODUCT_FEATURES = ['title', 'price', 'image'];
const SCORE_THRESHOLD = 4;
// Array of numbers corresponding to the coefficients
const coefficients = Object.values(defaultCoefficients);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of values returned by Object.values is implementation-specific and shouldn't be relied upon. We can't guarantee that the coefficients are returned in the same order every time.

We could:

  • Store an array in the JSON file.
  • Have a utility function that takes in the coefficient object and returns the coefficients in the correct order, perhaps as a static method on RulesetMaker.
  • Change RulesetFactory to take the coefficient object in as a parameter instead of the list. I'm not sure how we'd modify the trainees file though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I went for option two.

isNearbyImageXAxisCoeff,
hasPriceishPatternCoeff,
]).against(doc).get(`${feature}`);
let fnodesList = rules.against(doc).get(feature);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still running the ruleset against the document three separate times, but we only need to run it once and extract all three features, don't we?

if (feature === 'image') {
extractedProduct[feature] = extractedElements[feature].src;
}
extractedProduct[feature] = extractedElements[feature].innerText;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the value if the feature was image since the if statement is missing an else.

*/
constructor(coefficients) {
[
this.largerImageCoeff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unaware you could use this in destructured arrays, huh!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it took me a while to figure out how I could use destructuring here.

const upperHeightLimit = viewportHeight * 2;
// Use a falling trapezoid function to score the element
// Taken from: https://github.com/mozilla/fathom-trainees
if (top >= upperHeightLimit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole "falling trapezoid" function thing is still very confusing to me. I feel like comments describing what we're doing in context would make things easier:

// If the node is below the fold by more than a viewport's length, return a low score.
if (top >= upperHeightLimit) {
  return ZEROISH * featureCoeff.
}

// If the node is above the fold, return a high score.
if (top <= viewportHeight) {
  return ONEISH * featureCoeff;
}

// Otherwise, scale the score linearly between the fold and a viewport's length below it.
const slope = (ONEISH - ZEROISH) / (viewportHeight - upperHeightLimit);
return (slope * (top - upperHeightLimit) + ZEROISH) * featureCoeff;

Sorry if I'm being picky, I'm just trying to ensure I won't go blank next time I look at this. :P

src/trainees.js Outdated
[
/**
* A ruleset that finds the main product image on a product page.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment seems unnecessary given that the ruleset has a key and it's clear from the rest of the file that these are rulesets. The ruleset name comment can probably go as well.

src/trainees.js Outdated
coeffs,
/**
* @param {Array.number} coefficients
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also unnecessary to comment this function, the signature of these functions is explained above already.

src/trainees.js Outdated
// The coefficients are updated over time during training, so create a new factory for
// each iteration
const rulesetFactory = new RulesetFactory(coefficients);
return rulesetFactory.makeRuleset(); // The ruleset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't really add anything either.

src/trainees.js Outdated
/**
* @param {Array.number} coefficients
*/
rulesetMaker(coefficients) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't change between the features, so we can define it once above and use it three times here.

src/trainees.js Outdated
@@ -0,0 +1,101 @@
/* This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is also extraction-related, and should be placed inside the extraction folder along with the other files.

These rules successfully pull out product title, price and image from the following product pages (one each from the 5 top sites):
* [Amazon](https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3)
* [Ebay](https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998)
* [Walmart](https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640)
* [Best Buy](https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136)
* [Home Depot](https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501)

TODO:
* Create a training set with FathomFox and run these rules against them to measure their accuracy for 50 product pages (10 from each top site).
* Modify trimTitle method, so it doesn't cut off the color from the title for the product on Ebay.
* Generalize formatPrice method. @Osmose, would you have any suggestions?
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element.

The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
This enables the same script to be used for training and running in the commerce webextension.

How to train a ruleset with Fathom:
1. Follow Fathom's [Trainer instructions](https://github.com/erikrose/fathom-fox#the-trainer).
2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile.
3. Install FathomFox in that window from AMO.
4. Drag and drop the training corpus into that window.
  - Note: The training corpus are HTML files frozen using [FathomFox's DevTools panel](https://github.com/erikrose/fathom-fox#the-developer-tools-panel); our training corpus is on the shared "commerce" Google drive.
  - Note: As of the date of this commit, the Corpus Collector is not a recommended option for building a training corpus due to a `freeze-dry` dependency bug that inserts a bunch of extra garbage when re-freezing a frozen page.
5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it.
6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument.
7. Comment out the rules pertaining to all but that feature.
  - Currently, you can only train one ruleset at a time with Fathom, and only one `out` (e.g. 'title', 'image' or 'product') at a time for a given ruleset.
  - If you have multiple `out`s you'd like to train simultaneously, repeat this process for the remaining features so Fathom is running in a separate browser window for each feature and its corresponding rules.
8. Click the FathomFox browserAction and select "Train"
9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button.
10. The array of coefficients displayed on the training page will update over time as Fathom optimizes them; this could take a while.
These rules and coefficients yield the following accuracy based on a training corpus of 50 product pages from our top 5 sites (Amazon, Ebay, Walmart, Best Buy and Home Depot):
* 100% for product 'image'
* 96% for product 'title'
* 94% for product 'price'
Product 'price' and 'title' features have proximity rules based on the highest scoring product 'image' element. For now, this is done by accessing the image fnode using an internal '_ruleset' object; @erikrose is working on better support for this use case in the very near future, so this implementation can be improved at that time.
The first script, 'ruleset_factory.js', exports a class to create a ruleset based on a set of coefficients; instances of this class are used in production (via 'fathom_extraction.js') and for Fathom training (via 'trainees.js').
2. The second script, 'trainees.js', is used exclusively for training using the FathomFox web extension and does not ship with the commerce web extension.

Additional changes and notes:
* I chose not to make use of the 'autobind' decorator in 'ruleset_factory.js', since it is also used in the training add-on, where devDeps like 'babel-core' and 'babel-plugin-transform-decorators-legacy' do not exist.
* I also turned off an eslint rule that requires class methods to use 'this', since some methods in RulesetFactory don't require it, and it would be tedious and confusing to call some methods on the class instance and others on the class itself.
* The new training script ('trainees.js') has three elements in the map it exports, one for each product feature ('image', 'title', 'price'). This allows us to select which feature to train from a dropdown menu on FathomFox's trainer page.
* Currently, for training, four files must be copied over into the 'fathom-trainees' add-on src directory:
  * config.js
  * fathom_default_coefficients.json
  * ruleset_factory.js
  * trainees.js (overwritting the existing file)
* In a separate commit, I will put all the Fathom extraction files into an 'extraction' (or similar) subfolder.
@biancadanforth
Copy link
Collaborator Author

@Osmose Ready for you again. I replied to erikrose in the Fathom needs/aspirations doc re: the utils versus utilsForFrontend strangeness.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc, just one minor thing. Great work!

* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

/* eslint-disable import/no-unresolved */
// This file is moved up a level to the ./src folder for training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just remove extraction from the imports here? Then the file doesn't need to be moved.

Copy link
Collaborator Author

@biancadanforth biancadanforth Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fathom-trainees add-on assumes trainees.js is dropped into the ./src folder. With a config.js dependency in the WebExtension, we would need to change some paths around (either the extraction subfolder or the path to config.js for training).

However, this comment made me realize my reason for using config.js was no longer necessary, so having removed it, I can put all the scripts needed for training into the top level ./src folder without this additional step! Thanks.

Also updated the Code Organization section of the README to include the new 'extraction' subfolder.
@biancadanforth biancadanforth force-pushed the 36-fathom-rules branch 2 times, most recently from 5982fae to 60d4493 Compare August 27, 2018 18:43
This const was not actually needed by more than one file, which simplifies how 'trainees.js' and its imported scripts are used for training.
@biancadanforth biancadanforth merged commit 5c50172 into master Aug 27, 2018
@biancadanforth biancadanforth deleted the 36-fathom-rules branch August 27, 2018 19:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants